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Abstract 


One significant challenge to scaling entity resolution algorithms to massive 
datasets is understanding how performance changes after moving beyond the 
realm of small, manually labeled reference datasets. Unlike traditional machine 
learning tasks, when an entity resolution algorithm performs well on small hold¬ 
out datasets, there is no guarantee this performance holds on larger hold-out 
datasets. We prove simple bounding properties between the performance of a 
match function on a small validation set and the performance of a pairwise entity 
resolution algorithm on arbitrarily sized datasets. Thus, our approach enables op¬ 
timization of pairwise entity resolution algorithms for large datasets, using a small 
set of labeled data. 


1 Introduction 

Entity resolution (ER) is the task of identifying records belonging to the same entity (e.g. individual, 
product) across one or multiple datasets. Ironically, it has multiple names: deduplication and record 
linkage, among others (T). For example, ER is used to disambiguate shopping products [0, merge 
datasets of users from disparate sources, or even profile potential terrorist threats. With the use of 
blocking techniques, entity resolution can be scaled to many millions of records 0 . 

The canonical example in Table Q] illustrates the usefulness of pairwise ER for these application 
domains. Initially, the match function may only predict n « r 2 using the common phone number, 
where ss denotes a match. A partial name may not be a strong enough commonality to predict either 
of these individually match r 3 . However, the merge of these records (r\, r^), where ( ) denotes a 
merge, provides the full name ‘John Doe’ and enables correctly merging all three records. 

To design an effective entity resolution system, one would optimize over the ER merge and match 
functions. One might be tempted to evaluate and optimize an ER system on a small dataset with 
known labels, and then extend this to real-world applications. We stress that performance on small 
datasets does not necessarily imply similar performance on large datasets. Unlike more traditional 
machine learning tasks, in ER applications the number of entities often scales linearly with the size 
of the dataset m. This is not true in other clustering problems, where the number of clusters is 
typically constant or sublinear with the dataset size - a significantly easier problem. Further, the ‘no 
negative evidence’ assumption nisi can cause a ‘snowball effect,’ wherein several false positives 
trigger many more clusters to merge, leading to a detrimental degradation in performance. 
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Table 1: Canonical Entity Resolution Example 


Record 

Namel 

Name2 

Phone 

T\ 

John 

D. 

377-8328 

r-i 

J. 

Doe 

377-8328 

r-i 

John 

Doe 



Precision Degradation 



Figure 1: A simple experiment demonstrates the potential degradation of pairwise precision as the 
size of the dataset increases. Here the ‘Original’ algorithm (dashed line) was tuned for optimal 
performance on a training set of 100 records. ‘Optimized Lower Bound’ (solid line) shows our 
results after instead optimizing model parameters over the larger dataset’s estimated lower bound. 
‘True’ (dotted line) shows the actual performance corresponding to this lower bound. 


Consider the simple example in Figure Q] using synthetic data described in Section 15.11 First, we 
learned a match function using a small training dataset of 100 records. On a test dataset of compara¬ 
ble size, it achieved near perfect pairwise precision and recall. However, as we added new entities to 
the test dataset, pairwise precision significantly degraded - an extreme example of the entire dataset 
snowballing into a single entity. More importantly, near perfect performance on the larger datasets 
was possible (dotted line), just with different match function parameters. Using our approach to in¬ 
stead optimize over the larger dataset’s estimated lower bound dramatically improves performance 
on the large set (solid line). 

Although performance on a small labeled dataset does not directly equate to performance on an 
actual larger dataset, some useful information does exist which we will leverage into an estimated 
lower bound for ER performance on arbitrarily sized problems. Then, optimization of the estimated 
lower bound allows tuning of pairwise ER systems for large datasets. 

In this paper, our contributions are: 

1. Theoretical Performance Bounds: We prove simple, estimated, lower bounds on pairwise 
recall, precision, and F-\ performance metrics for arbitrarily sized datasets, under reason¬ 
able assumptions and given a small number of labeled record pairs. 

2. Empirical Tightness: We evaluate the bounds on one synthetic and three real world datasets 
to demonstrate the theoretical bounds are tight to the true performances. 

3. Optimal Merge Function: Given any match function, we prove a lower-bound optimal 
merge function and ‘wrapper’ for the match function. This conservative strategy is equiva¬ 
lent to finding all connected components, a key insight of the simple bounds. 

The remainder of the paper is organized as follows. We begin section [2] with a quick overview of 
related work in the field of entity resolution. In sections 0 and [4] we derive the estimated lower 
bounds and optimal merge function, respectively. Lastly, in section [5] we demonstrate the empirical 
tightness of the bound on real world datasets. 
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2 Related Work 


Entity resolution encompasses a broad set of approaches, including many adapted from the ma¬ 
chine learning, optimization, and graph theory domains. Strategies appropriate for ER includes 
hierarchical clustering 0, integer linear programming 0, latent Dirichlet allocation 0, pairwise 
match/merge J4j. Markov logic (Bj and hybrid human-machine systems |9). Pairwise entity resolu¬ 
tion approaches are appealing because they use an intuitive and easy to implement iterative match 
and merge process between pairs of records. Further, under certain assumptions, pairwise algorithms 
will perform the optimal number of record comparisons 0. 

Perhaps the most general framework for pairwise entity resolution was presented by Benjelloun et 
al. 0. They outlined a theoretically disciplined approach, wherein certain properties of the match 
and merge function guarantee a deterministic output in the optimal number of record comparisons. 
We explore the use of some of these properties in the derivation of our bounds. Collectively, these 
properties are referred to by their acronym ICAR: 

1. Idempotence: Vr, r « r and (r, r) = r. 

2. Commutativity: Vri, r 2 , r\ ~ r 2 iff r 2 « r±, and if r 3 « r 2 , then (n, r 2 ) = (r 2 , r\). 

3. Associativity: Vri,r 2 ,r 3 such that (ri,(r 2 ,r 3 )) and ((ri,r 2 ),r 3 ) exist, (ri,(r 2 ,r 3 )) = 
«rt,r 2 ) ,r 3 ). 

4. Representativity: If r 3 = (n, r 2 ) then for any such that n « r^, we also have r 3 ~ r±. 

The first three properties are straightforward and reasonable to assume for most ER systems. The 
crux of determinism falls on the final property, representativity. We, too, will take advantage of 
this convenient property, leaving the interesting problem of how relaxing this assumption affects the 
performance bounds for future work. Intuitively, representativity means merging any two records 
can only monotonically increase their chance of matching with other records. This is also referred 
to as the ‘no negative evidence’ clause. 

3 Lower Bounds of Performance 

Although many metrics exist to evaluate entity resolution performance when a ground truth dataset 
is available, this is rarely the case. Not surprisingly, human-generated clusterings rarely number 
beyond a thousand records 0 - a relatively easy ER problem. Even finding publicly available 
datasets with ground truth so that we could objectively evaluate our results was a trying task. 

In the simplest setting, we assume we have access to some pairs with known binary match/mismatch 
label y , such that xii£p(x\y). For large datasets, finding all records belonging to one entity is a worst- 
case combinatorial problem, but finding just two matching records is relatively easy using a hybrid 
human-machine system 0 or with strong features (e.g. phone number, product ID) 

With both match and mismatch pairs at our disposal, we created a training and validation set of 
labeled pairs. The remaining records form the test dataset. Note the training and validation sets will 
likely have significantly different class balance, cluster sizes, and overall number of samples than 
the test set. Though an entity resolution algorithm may perform well on the validation set with few 
samples and small cluster sizes, this may not indicate strong performance on the full dataset with 
millions of records and many more clusters. In practice, a developer needs to know performance 
guarantees of the test set because this is the deployed system. 

Here, we derive precise relationships between the performance of the match function on the valida¬ 
tion record pairs and estimated lower bounds on ER pairwise precision, recall, and fj on the test 
set. Our notation for the following proofs, which the reader may find convenient to refer back to, is: 

(r,, r 3 ) Record formed by merging records r, and r 3 . 

V Set of validation record pairs with known labels, V = {( 0 , r[), ..., (r m , r' m )}. 

Vs Set of record pairs in the validation set with positive label, 

V s = {(n,r'i) : yi = 1, V(r i; r') G V}. 
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Vm Set of record pairs in the validation set that are predicted to directly match, 

V M = {( r^r'i ) : n « r', V(n,r') G V}. 

T Set of test records {r i, ..., r n }. 

Tm Set of record pairs in the test set that are predicted to directly match, 

Tm = {( n,rj) : n « rj,i < j, Vn,rj G T}. 

R Set of record pairs in the entity resolution clustering of the test set. 

S Set of record pairs in the true clustering of the test set (unknown). 

Prec(R , S) Precision of predicted and true positive pairs, Prec(R , S) = |f? f~l Sj/|i?|. 
Recall(R , S) Recall of predicted and true positive pairs, Recall(R, S) = |f? fl Sj/jSj. 


Cv 

Ct 


Class balance of pairs in the validation set, Cv = |Vg|/|y|. 

Estimated class balance of pairs in the test set, Ct = \S\/\Pairs(T)\. 


Lemma 1. For entity resolution systems satisfying the representativity property, every record pair 
that directly matches will end up in the same entity. 


T m G R. (1) 

Additional pairs in R can occur from chains of matches (i.e. r\ « r%, r 2 ~ r 3 , thus (V|, r. 3 ) G R) 
and from merging (see Table |T}. However, we are unable to make strong claims about the additional 
matches since composite records do not occur in the validation set. 


Proof. Suppose on the contrary there exists a pair of records {r\,r2), such that (n, rf) G Tm but 
(ri, rf) $. R- In other words, r\ « r 2 and they are resolved to separate entities R = (ri, ....) and 
I 2 = {t 2 > •••■)• Since these clusters were not merged in the ER process, (n,....) 5 6 (r 2 , ....), which 
contradicts the representativity property. □ 


Theorem 1. The pairwise precision of an entity resolution result can be lower bounded by: 


E [ Prec(R , S )] > 


\T m \ ( C T (1 - C v )E [. Prec(V M , H s )] 

\R\ V^(l - Ct) + (C T - C v )E [. Prec(V M , H s )] 


( 2 ) 


The bound is composed of two parts. |Tm|/|-R| is the fraction of record pairs in the test set entity 
resolution that directly match, which we can make stronger claims about. Prec(VM,Vs) is the 
precision of these direct matches, adjusted for the change in class balance. 


Proof. From Lemma 1 

E [ Prec(R , 


and applying the definitions of pairwise precision for R and Tm- 

S)\= E 

> E 

= J-^E [Prec(T M ,S)}, 

> \Tm\ ( _ C T ( 1 — TV)E [Prec(V M , Vg)] _ 

- \R\ \CV (1 — C T ) + (C T — Cy)E [Prec(V M ,V S )) 


|izn5|' 

| Tm n S\ 
\R\ 


where the last step follows from equating the match function validation set performance to the 
expected match function test set performance using change in match/mismatch class balance. □ 


Most of the values are straightforward to count from the resolution. |f?| is the number of pairs in 
the clustering output. \Tm\ is the number of records that directly match, which by Lemma 1 can be 
efficiently computed as E( ri ,r 2 )eR r i ~ r 2- 
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The class balance of the validation set Cy is known, but we must estimate Cf • We refer the reader 
to state-of-the-art results for class prior estimation mm 

Theorem 2. The pairwise recall of an entity resolution result can be lower bounded by: 

E [ Recall(R , S)] > E [ Recall(V M , Ts)]. (3) 

In other words, the recall on the validation set already forms a lower bound for the pairwise recall 
on the test resolution. 

Proof. From the definitions of pairwise recall for Tm and R and then applying Lemma 1: 

E [Recall{R, S')] = E 

> E 

= E [Recall (Tm, S)], 

= E [Recall(VM, Vs)], 

where the last step does not require class rebalancing because recall is not a function of class balance 
(unlike precision, it only depends on the positive pairs). □ 

A lower bound on pairwise F\ (the harmonic mean of pairwise precision and recall) can be computed 
with the two former lower bounds. We will focus more on measuring both pairwise precision and 
recall as they are more informative than the aggregated F\ metric. 

4 Optimal Merge Function 

Given any match function m satisfying the idempotence and commutativity properties, we will prove 
a merge function and ‘wrapper’ match function that optimize the estimated lower bounds. Since the 
idempotence property is trivially satisfied for any match function by checking for identical records 
and the commutativity property is satisfied by checking both directions rq ~ r 2 and r 2 « rq, 
this essentially holds for all pairwise match functions. These match and merge functions form a 
conservative strategy, but provide the lower bound optimal performance given only labeled pairs. 

We consider the original set of records R = {r i,.... r n } and use notation o for a record formed by 
merging at least two other records. 

Theorem 3. For any match function, the pairwise precision, recall, and F\ estimated lower bounds 
are optimal for the merge function: 

(oi,o 2 )= (J n (4) 

r*Goi,02 

The corresponding ‘wrapper’ match function between o\ and 02 is: 

cq ~ o 2 = max m(ri,rf). ( 5 ) 

T*»€oi,rj-€o2 


\RnS\' 

~\sT. 

| Tm n S| 

|S| 


Proof We will show both directions, that the optimal merge function and match ‘wrapper’ must 
make at least these matches to satisfy the ICAR properties, and that any additional matches will 
decrease the estimated performance lower bound. By the definition of the set union operator, the 
merge function is associative. The rest of the proof will focus on the representativity property. 

Direction P. We are constrained by match and merge functions that satisfy the ICAR properties. In 
the first direction, we will show these are the minimum matches required to satisfy representativity. 
Assume on the contrary: there exist two composite records cq and o 2 , such that cq f o 2 but one 
pair of their constituent records match, i.e. r* ~ r :) , for some r, G cq, r 7 G o 2 . By definition, this 
contradicts the representativity property. 

Direction 2: In the second direction, we will show any additional matches will increase |f?| and thus 
decrease the estimated pairwise precision lower bound. Assume there exist two records cq and o 2 , 
such that cq ~ o 2 but none of their constituent records match, i.e. r* 96 rj, Wy G cq, ry G 02. The 
additional match cq « o 2 may increase |f?|, thus decreasing Prec(R, S). □ 
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The simplicity of this approach is derived from only claiming performance knowledge of direct 
record matches from the validation set performance. Interestingly, this ER system is equivalent to 
finding all connected components, where each edge A l3 = n « r 3 in the adjacency matrix A. We 
stress that though this may optimize the estimated lower bound performances, it does not necessarily 
guarantee better performance. However, if ground truth is not available for a dataset of comparable 
size to the deployed system, then this is now a theoretically well motivated approach. 

A significant benefit of Theorem[3]is the provided match function need not satisfy the very restrictive 
representativity property. Further, since the idempotence and commutativity properties are trivial 
to satisfy, m can be essentially any match function. For example, one could use more complex 
machine learning based match functions (e.g. kernelized SVM, random forests) and featurizations 
which may not have intuitive merge operations (e.g. word2vec IfTZl . Brown clustering H131 ). Using 
less restrictive match functions undoubtedly enables better Prec(VM,Vs) and Recall(VM,Vs), 
further improving the lower bounds. 

5 Experiments 

We conducted experiments on multiple datasets with known ground truth to empirically demonstrate 
the tightness of the estimated lower bounds. Specifically, we are interested in optimizing ER model 
parameters over the estimated lower bounds and over the ground truth metrics to show they achieve 
similar results. 

5.1 Datasets 

We used one synthetic and three real world datasets with known ground truth for our experiments, as 
described in Table[2] For all these datasets, the goal of entity resolution is to find records describing 
the same entity (e.g. restaurant, product, or person). For the synthetic dataset, we generated each 
record’s features using a feature vector unique to its respective entity, plus random Gaussian noise. 

Unlike general machine learning tasks, publicly available entity resolution datasets with known 
ground truth are extremely limited, and do not number beyond several thousand records. The restau¬ 
rant dataset is one of the earliest ER tasks discussed in literature lfl4l . and still used today |f9] 173 1 . 
Unfortunately, the dataset is also relatively small - numbering only 864 records and five features 
(name, phone number, street address, city, cuisine). We threw away the phone number feature be¬ 
cause it made the problem too simple. The Abt-Buy dataset is more recent, larger at 2173 records, 
and used extensively in current research ll9l lT6l . It consists of product information from two retailers, 
including product name, description, and price. 

Both the Restaurant and Abt-Buy datasets are a class of entity resolution known as clean-clean, 
wherein two ‘clean’ datasets with completely resolved entities are merged together 0. This problem 
is easier than the more general problem of resolving entities with an unknown number of records. 
To formulate these datasets in a more general context, we merged them together into a single ‘dirty’ 
dataset and ignored the advantageous ‘clean-clean’ knowledge in our experiments. 

Fastly, we evaluated a subset of a personal ads dataset scraped from escort advertising websites over 
the past few years sm We used natural-language-processing algorithms to extract 20 features, such 
as name, age, location, and hair color of the person being advertised. For ground truth, we used a 
subset of the data containing phone number matches as a proxy label. Although phone numbers will 
not allow us to discover the full ground truth, it is reasonable to assume ads with the same phone 


Table 2: Datasets used in the experiments 


Dataset 

# dim 

# records 

# matches 

Synthetic 

10 

1000 

4500 

Restaurant 1 

4 

864 

112 

Abt-Buy 2 

3 

2173 

1118 

Escort (subset) 

20 

10000 

10596 


1 http://www.cs.utexas.edu/users/ml/riddle/data/restaurant.tar.gz 

2 http://dbs.uni-leipzig.de/file/Abt-Buy.zip 
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Match Threshold Match Threshold 


(a) Synthetic precision (b) Synthetic recall 



Match Threshold Match Threshold 


(c) Restaurant precision (d) Restaurant recall 



Match Threshold Match Threshold 


(e) Abt-Buy precision (f) Abt-Buy recall 



Match Threshold Match Threshold 


(g) Escort precision 


(h) Escort recall 


Figure 2: Experimental results demonstrate model parameters can be tuned to optimize estimated 
lower bound pairwise precision and recall of the test set. The resulting estimated lower bound is 
close to the true performance. Pairwise F\ is not shown because it is the harmonic mean of the two 
former metrics, and is thus less informative. 
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number belong to the same entity (i.e. person or group) because those numbers are the means of 
contact for potential customers. 

5.2 Entity Resolution 

We used the R-Swoosh algorithm for our ER systems S). For the merge function, we simply used 
the set union of the respective features. For example, in Table Q] (r-[ , r 2 ) would be [{J., John}, {D., 
Doe}, {377-8328}]. 

For the match function, we trained a binary logistic regression classifier using known matches and 
mismatches in the training dataset. Fike all pairwise entity resolution algorithms, it operates on 
pairwise features, which we computed from two records’ features using either a binary match (e.g. 
state, hair color), numerical difference (e.g. ages, weights), or Fevenshtein string edit distance (e.g. 
name) of each feature pair. If a record had multiple of a particular feature from a merge operation, 
we used the closest feature match. 

Another benefit of using a probabilistic match function is the choice of parameters is reduced to a 
single value: the cut-off threshold. The choice of cut-off threshold is a classic trade-off between 
precision and recall - an ideal setting to examine the results of our bounds. 

5.3 Results 

To examine the efficacy of the estimated lower bound in tuning an entity resolution system, we 
evaluated the true and lower bound performances across tightly spaced intervals of match cut-off 
thresholds, as shown in Figures^ The tightness of the bounds demonstrate two important qualities. 
First, they enable the optimization of model parameters (e.g. cut-off threshold) using the estimated 
lower bound. Though this may not necessarily result in the true (unknown) optimal parameters, it 
will result in the best estimated lower bound. Second, it enables enforcing a level of acceptable 
quality prior to the use of any entity resolution results. 

One may be surprised to see the estimated lower bound exceed the true performance. This is, 
indeed, possible because of uncertainty in estimations of Prec(VM, Vs), Recall(VM, Vs) and Ct- 
The 95% confidence intervals are obtained via the propagation of validation set Wilson scores for 
precision and recall fT8l . Uncertainty increases as the gap between validation set and test set sizes 
widen, a phenomenon observable in Figure |T] For very small datasets such as Restaurant, we were 
restricted to using minimal validation samples due to the small number of labels. However, for 
larger experiments such as Abt-Buy and Escort, we could afford hundreds or thousands of validation 
samples, significantly reducing uncertainty. This is also theoretically motivated by the shift in class 
balance in Theorem 1. 

The four experiments demonstrate different ER behavior. The synthetic experiment has a narrow 
range of model parameters with perfect precision and recall, where performance degrades dramati¬ 
cally outside this range. The Restaurant experiment has a more gradual tradeoff between precision 
and recall, though there is a significant uncertainty in the lower bound estimate due to the limited 
number of validation samples. Precision in Abt-Buy quickly degrades, though recall is much more 
gradual. Our bounds correctly capture the need to improve the underlying ER systems for the Abt- 
Buy and Escort datasets. Without this lower bound, the poor performance on larger datasets would 
not be evident from smaller tests. 

6 Conclusions 

Performance optimization of scalable entity resolution systems is challenging because unlike other 
machine learning tasks, there is not a clear understanding of how behavior will change on larger 
datasets. In this paper, we developed a simple - yet effective - method for optimizing lower bound 
performance using a small set of labeled pairs. 

Further, we showed the optimal lower bound strategy for any match function is the connected com¬ 
ponents problem from graph theory - a relatively conservative clustering approach compared to 
many ER systems. We understand that this does not necessarily guarantee better performance, but 
it does provide a better lower-bound guarantee. For instance, in our original example in Table|T| 7-3 
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would have matched neither n nor r 2 - However, when labeled datasets of comparable size to the 
deployed system are not available, this is now a theoretically well motivated approach. 

Our bounds specifically addressed performance of pairwise entity resolution algorithms satisfying 
the ICAR properties 0. Pairwise algorithms are intuitive, easy to implement, and perform an 
optimal number of pairwise record comparisons. However, they are also only a subset of entity 
resolution approaches mi a m 0 m m. Further, we only considered pairwise precision, recall, 
and Fi due to their popularity, intuitive interpretation and mathematical convenience, though other 
existing metrics have been shown to produce conflicting rankings f2). 

Estimating the lower bounds relies on accurate estimations of several other quantities, including 
recall and precision on the validation set and class prevalence estimation in the test set. Especially 
as datasets scale to much larger sizes, our bounds rely on these estimates. As evident in TheoremQ] 
and in our experiments, uncertainty increases as the gap between validation and testing set sizes 
widened. 
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